gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior by pablogsal · Pull Request #149649 · python/cpython

pablogsal · 2026-05-10T18:29:54Z

Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.

This PR improves the profiler by:

reading hot interpreter/thread/frame structs with exact remote reads instead of page-cache reads
tracking the number of live page-cache entries so cache clear/search only scans the used prefix
batching predicted interpreter/thread/frame reads with process_vm_readv() on Linux
reusing cached FrameInfo and thread id objects when frame-cache hits dominate

The last_profiled_frame remote-write suppression is already present in current upstream/main, so this branch keeps that baseline behavior and builds on top of it.

Benchmark

Benchmarked with:

./python Tools/inspection/benchmark_external_inspection.py --duration 3

For the per-commit measurements, I used the same benchmark workload in quiet mode with cache_frames=True and all_threads=True.

Step	Work rate	Sample rate	Avg call time	Incremental win
`upstream/main` baseline	382,593/s	363,963/s	2.614 us	-
Exact reads for hot structs	549,741/s	514,642/s	1.819 us	+167,148/s (+43.7%)
Track live page-cache entries	747,946/s	688,282/s	1.337 us	+198,205/s (+36.1%)
Batch predicted remote reads	897,703/s	806,084/s	1.114 us	+149,756/s (+20.0%)
Reuse profiler result objects	1,187,314/s	1,047,698/s	0.842 us	+289,611/s (+32.3%)

Final benchmark using the benchmark script:

Metric	Baseline	Final	Change
Work rate	382,593/s	1,180,434/s	3.08x faster
Sample rate	363,963/s	1,038,965/s	2.85x faster
Avg call time	2.614 us	0.85 us	-67.5%
Success rate	99.67%	99.84%	+0.17 pp

Final benchmark output:

Average call time:   0.85 us
Work rate:           1,180,434.2 calls/sec
Sample rate:         1,038,964.9 samples/sec
Success rate:        99.84%

Issue: _remote_debugging: reading whole pages over and over #149584

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal · 2026-05-10T18:34:01Z

@maurycy do you mind reviewing this PR?

Use the frame cache to predict the next thread state and top frame address, then batch interpreter/thread/frame reads with process_vm_readv when profiling a Linux target. Reuse prefetched frame buffers in the frame walker when the prediction is valid.

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

maurycy · 2026-05-11T09:46:05Z

@pablogsal Awesome job, I'm delighted :) Definitely will do!

Perhaps it's worth an entry in the "Optimizations" section?

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

pablogsal · 2026-05-13T00:02:11Z

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

Added async benchmark modes so the benchmark harness can exercise get_stack_trace(), get_async_stack_trace(), and get_all_awaited_by().

pablogsal · 2026-05-13T00:03:18Z

Perhaps it's worth an entry in the "Optimizations" section?

Since is beta1 I think we are good for now. Later makes sense

pablogsal · 2026-05-13T00:04:00Z

Can you give it another go?

maurycy · 2026-05-13T08:26:06Z

@pablogsal Left two more comments:

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior #149649 (comment)
gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior #149649 (comment)

(If they do not show, you have to check the resolved threads; the comments are follow-ups.)

pablogsal · 2026-05-13T08:27:10Z

Will take a look in 1h

maurycy · 2026-05-13T09:04:59Z

@pablogsal One last thing (in the current round): after v. before numbers for the newly added async benchmarks in the description, to see the gain and for future reference (for git blame tourists, like me)

pablogsal · 2026-05-13T09:22:50Z

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

maurycy · 2026-05-13T09:52:48Z

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

Thank you. (TailScale FTW :))

bedevere-app Bot added the awaiting core review label May 10, 2026

bedevere-app Bot mentioned this pull request May 10, 2026

_remote_debugging: reading whole pages over and over #149584

Open

pablogsal added 2 commits May 10, 2026 19:31

pythongh-149584: Avoid page reads for hot profiler structs

c5520d3

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

pythongh-149584: Track live remote page cache entries

8be8d7d

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal force-pushed the speedups branch from fa43c34 to 7899a6a Compare May 10, 2026 18:31

pablogsal mentioned this pull request May 10, 2026

gh-149584: Do not use page cache for thread/frame/interp reads #149585

Closed

pablogsal force-pushed the speedups branch from 7899a6a to d5d3f3b Compare May 10, 2026 18:35

pablogsal added 3 commits May 10, 2026 20:22

pythongh-149584: Reuse profiler result objects

c69a0f3

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

pythongh-149584: Add NEWS for Tachyon profiler overhead fix

7a85c9a

pablogsal force-pushed the speedups branch from d5d3f3b to 7a85c9a Compare May 10, 2026 19:22

maurycy mentioned this pull request May 12, 2026

_remote_debugging: returning the same samples over and over #149718

Closed

maurycy reviewed May 12, 2026

View reviewed changes

Comment thread Modules/_remote_debugging/code_objects.c

Comment thread Modules/_remote_debugging/threads.c Outdated

Comment thread Modules/_remote_debugging/module.c

Comment thread Modules/_remote_debugging/_remote_debugging.h Outdated

pablogsal added 3 commits May 13, 2026 00:53

Address review feedback

46a0b2c

Add async benchmark modes

f7fe3be

cleanp

2995187

pablogsal requested a review from maurycy May 13, 2026 00:03

small cleanp

d576d38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
pablogsal wants to merge 9 commits into
python:mainfrom
pablogsal:speedups

pablogsal commented May 10, 2026 •

edited

Loading

Uh oh!

pablogsal commented May 10, 2026

Uh oh!

maurycy commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 •

edited

Loading

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 •

edited

Loading

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pablogsal commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

pablogsal commented May 10, 2026

Uh oh!

maurycy commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pablogsal commented May 10, 2026 •

edited

Loading

maurycy commented May 11, 2026 •

edited

Loading

maurycy commented May 13, 2026 •

edited

Loading

maurycy commented May 13, 2026 •

edited

Loading